Importing the relevant python libaries#
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import csv
import time
import preprocessing
import random
import statsmodels
Table of contents#
Introduction
Datasets and preprocessing
Who knows whom? - Degree distribution
Connectedness(to be writting in final version)
Small world property
Friendships in communities - Triadic closure
Conclusion
References
Appendix
General todo: elke vis moeten een titel en namen bij de axis hebben
1. Introduction#
Have you ever wondered how people influence each other? Or how concepts, ideas and facts spread through networks of persons? These types of questions related to social influence, can be answered with the help of network science. The aim of this project is to use network science theory to compare properties of social networks. A network is a web that consists of nodes and edges. People, devices, accounts or other objects are represented as nodes and the edges represent the relationships between these nodes. All these nodes and edges together form a network.
In real life we meet new friends through acquaintances. It is plausible that social influence is also spread through acquaintances. Fortunately, the network datasets we are going to use for this project, also provide node attributes. These attibutes are often genre or artist preferences with music related networks. In consequence, similarities between entities in a network can be calculated and compared.
The concept of social influence can be transformed to the following hypothesis:
If two nodes are similar in the network, then they must have similar node attributes.
Or written in first order logic:
\(GraphSim(x,y)\): x is similar to y in the network
\(AttrSim(x,y)\): x and y have similar attributes
\(\forall x\forall y(GraphSim(x,y)\rightarrow AttrSim(x,y))\)
In order to answer this question, we must first define what similarity is and how to compute similarity. For similarity in graphs there are two popular similarity measures: jaccard similarity and the overlap coefficient:
For the attribute similarity it is obvious to also use the jaccard similarity. This is a commonly used measure for categorical data, something that attributes certainly are.
Leaping into the future, we were able to successfully compute these similarities on a subset of the networks. By using cugraph in a Google Colab notebook we were able to use Google’s GPUs for efficient computing of these similarities. The code that was used to construct csv files with all the similarity data for different networks, can be found in the appendix. Let’s load in the files and see what this can teach us about social influence.
start= time.time() # om runtime te berekenen
similarities_file_paths = {
'FacebookLargepage' : 'datasets/facebook-large-page-page-network/musae_facebook_final_similarity.csv',
'FeatherDeezerSocial' : 'datasets/feather-deezer-social/deezer_europe_final_similarity.csv',
'GithubSocial' : 'datasets/github-social/musae_git_final_similarity.csv',
'TwitchSocialNetworksDE' : 'datasets/twitch-social-networks/DE/musae_DE_final_similarity.csv',
'TwitchSocialNetworksENGB' : 'datasets/twitch-social-networks/ENGB/musae_ENGB_final_similarity.csv',
'TwitchSocialNetworksES' : 'datasets/twitch-social-networks/ES/musae_ES_final_similarity.csv',
'TwitchSocialNetworksFR' : 'datasets/twitch-social-networks/FR/musae_FR_final_similarity.csv',
'TwitchSocialNetworksPTBR' : 'datasets/twitch-social-networks/PTBR/musae_PTBR_final_similarity.csv',
'TwitchSocialNetworksRU' : 'datasets/twitch-social-networks/RU/musae_RU_final_similarity.csv',
}
For this visualisation, all the data from all the different networks is aggregated.
frames = []
for i in similarities_file_paths.values():
x = pd.read_csv(i, index_col=0)
frames.append(x)
similarity_dataframe = pd.concat(frames)
# Uses qcut to convert to categorical variables
jaccard_coeff_cut, cut_bins = pd.qcut(similarity_dataframe['jaccard_coeff'], 3, retbins=True, labels=["Not Similar", "Average", "Similar"])
overlap_coeff_cut, cut_bins = pd.qcut(similarity_dataframe['overlap_coeff'], 3, retbins=True, labels=["Not Similar", "Average", "Similar"])
jaccard_genre_coeff_cut, cut_bins = pd.qcut(similarity_dataframe['jaccard_genre_sim'], 3, retbins=True, labels=["Not Similar", "Average", "Similar"])
jaccard_coeff_dim = go.parcats.Dimension(
values=jaccard_coeff_cut,
categoryorder='category descending', label="Jaccard coeff"
)
overlap_coeff_dim = go.parcats.Dimension(
values=overlap_coeff_cut,
categoryorder='category descending', label="Overlap coeff"
)
jaccard_genre_coeff_dim = go.parcats.Dimension(
values=jaccard_genre_coeff_cut,
categoryorder='category descending', label="Attribute similarity"
)
sim = go.Figure(data = [go.Parcats(dimensions=[jaccard_coeff_dim, overlap_coeff_dim, jaccard_genre_coeff_dim],
)])
sim.update_layout(title="Graph similarity and attribute similarity")
sim.show()
similarity_dataframe.corr()[['jaccard_coeff', 'overlap_coeff', 'jaccard_genre_sim']].loc[['jaccard_coeff', 'overlap_coeff', 'jaccard_genre_sim']]
| jaccard_coeff | overlap_coeff | jaccard_genre_sim | |
|---|---|---|---|
| jaccard_coeff | 1.000000 | 0.646626 | 0.024894 |
| overlap_coeff | 0.646626 | 1.000000 | 0.001387 |
| jaccard_genre_sim | 0.024894 | 0.001387 | 1.000000 |
(Not) being an influencer - Are these networks even real networks?#
Although it sounds plausible that similar nodes in a graph might have similar attributes, the above chart does imply that the hypothesis is invalid. By looking at the pearson correlation, it can be substantiated that the hypothesis is indeed invalid with pearson correlation values close to zero.
This raises the question whether or not the above networks are indeed real networks. To answer this question, a distinction must be made between random networks and real networks. According Barabasi (2016), a random network consists of N nodes where each node pair is connected with probability p. A random network is also called an Erdos-Renyi network in honor of the creators. Because a random network is constructed by chance, the characteristics are typical for this randomness. Barabasi also explains that real networks have significantly different network characteristics since these networks are not constructed by chance.
The goal of this project is to analyse different social networks along different network properties, through both the perspective of random networks, and real networks. With this analysis, it can be answered to which extend the different networks are indeed real networks.
2. Datasets and preprocessing#
This project is based on three different datasets of social networks that can be found on Kaggle (Appendix 2). Each dataset is a representation of a network in the form of a list of edges. To some datasets there is also a file added with node attributes. This file is often a .json file. The latest dataset is slightly different from the first two. Unique to this dataset, is the fact that it is a dataset containing multiple datasets of social networks.
Before the datasets can be read in, it is necessary to manually select which datasets will and will not be selected for the project. This selection is based on the size of the datasets to reduce the risk of very long processing times. Besides, any not connected network is excluded from the project to simplify calculation tasks. The table below is an overview of all the different networks, retrieved from the datasets, that will be used in this project. For each network, it is given what its node representation is, what its link representation is and what the node attributes are if applicable.
Dataset |
Node representation |
Link representation |
Node attributes if applicable |
|---|---|---|---|
NashvilleMeetupNetwork |
Member of a Meetup group |
Shared group membership in ‘weight’ groups |
N/A |
DeezerHR |
Deezer users from Croatia |
The relation friendship |
Genre preferences of each user |
DeezerHU |
Deezer users from Hungary |
The relation friendship |
Genre preferences of each user |
DeezerRO |
Deezer users from Romania |
The relation friendship |
Genre preferences of each user |
FacebookLargePage |
Official Facebook pages |
The amount of mutual likes between pages |
Descriptions of the purpose of the site |
FeatherDeezerSocial |
Deezer users from Europe |
The relation friendship |
Artists liked by the users |
FeatherLastfmSocial |
Lastfm users from Asia |
The relation friendship |
Artists liked by the users |
GithubSocial |
Developers who have starred at least 10 repositories |
Mutual follwing between developers |
Location, Repositories starred, Employer and E-mail address |
TwitchSocialNetworks |
Twitch users |
The relation friendship |
Games played and liked, Location and Streaming habits |
TwitchSocialNetworksDE |
Twitch users from Germany |
The relation friendship |
Games played and liked, Location and Streaming habits |
TwitchSocialNetworksENGB |
Twitch users from England |
The relation friendship |
Games played and liked, Location and Streaming habits |
TwitchSocialNetworksES |
Twitch users from Spain |
The relation friendship |
Games played and liked, Location and Streaming habits |
TwitchSocialNetworksFR |
Twitch users from France |
The relation friendship |
Games played and liked, Location and Streaming habits |
TwitchSocialNetworksPTBR |
Twitch users from Brazil |
The relation friendship |
Games played and liked, Location and Streaming habits |
TwitchSocialNetworksRU |
Twitch users from Russia |
The relation friendship |
Games played and liked, Location and Streaming habits |
SocEpinions1 |
Epinion users |
Trust relationship between users |
N/A |
SocSignSlashdot090221 |
Slashdot users 2009 |
Friend or enemy relationship |
N/A |
SocSlashdot0902 |
Slashdot users 2009 |
Friend or enemy relationship |
N/A |
Preprocessing non-csv datasets#
Some datasets of networks have their edges saved in a .txt file while others in a .csv file. In order to read the edges with the pandas read_csv function those .txt files need to be converted to .csv files. A function is used to accomplish this that can be found in the preprocessing.py file.
soc-sign-Slashdot090221.txt
soc-Slashdot0902.txt
# preprocessing.transform_text_to_csv('datasets/soc-sign-Slashdot090221.txt', 'datasets/soc-sign-Slashdot090221.csv')
# preprocessing.transform_text_to_csv('datasets/soc-Slashdot0902.txt', 'datasets/soc-Slashdot0902.csv')
Read in datasets with pandas#
The datasets can be easily read in with pandas.
NashvilleMeetupNetwork_member_edges = pd.read_csv('datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
DeezerHR_edges = pd.read_csv('datasets/DeezerSocialNetworks/HR/HR_edges.csv')
DeezerHU_edges = pd.read_csv('datasets/DeezerSocialNetworks/HU/HU_edges.csv')
DeezerRO_edges = pd.read_csv('datasets/DeezerSocialNetworks/RO/RO_edges.csv')
FacebookLargePage_edges = pd.read_csv('datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
FeatherDeezerSocial_edges = pd.read_csv('datasets/feather-deezer-social/deezer_europe_edges.csv')
FeatherLastfmSocial_edges = pd.read_csv('datasets/feather-lastfm-social/lastfm_asia_edges.csv')
GithubSocial_edges = pd.read_csv("datasets/github-social/musae_git_edges.csv")
TwitchSocialNetworksDE_edges = pd.read_csv("datasets/twitch-social-networks/DE/musae_DE_edges.csv")
TwitchSocialNetworksENGB_edges = pd.read_csv("datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
TwitchSocialNetworksES_edges = pd.read_csv("datasets/twitch-social-networks/ES/musae_ES_edges.csv")
TwitchSocialNetworksFR_edges = pd.read_csv("datasets/twitch-social-networks/FR/musae_FR_edges.csv")
TwitchSocialNetworksPTBR_edges = pd.read_csv("datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
TwitchSocialNetworksRU_edges = pd.read_csv("datasets/twitch-social-networks/RU/musae_RU_edges.csv")
SocSignSlashdot090221_edges = pd.read_csv('datasets/soc-sign-Slashdot090221.csv')
SocSlashdot0902_edges = pd.read_csv('datasets/soc-Slashdot0902.csv')
Construct graphs with networkx#
After that, undirected networkx graphs can be constructed from the pandas dataframes.
NashvilleMeetupNetwork = nx.from_pandas_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=nx.Graph)
DeezerHR = nx.from_pandas_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerHU = nx.from_pandas_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerRO = nx.from_pandas_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=nx.Graph)
FacebookLargePage = nx.from_pandas_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=nx.Graph)
FeatherDeezerSocial = nx.from_pandas_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=nx.Graph)
FeatherLastfmSocial = nx.from_pandas_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=nx.Graph)
GithubSocial = nx.from_pandas_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=nx.Graph)
TwitchSocialNetworksDE = nx.from_pandas_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksENGB = nx.from_pandas_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksES = nx.from_pandas_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksFR = nx.from_pandas_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksPTBR = nx.from_pandas_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksRU = nx.from_pandas_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=nx.Graph)
SocSignSlashdot090221 = nx.from_pandas_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=nx.Graph)
SocSlashdot0902 = nx.from_pandas_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=nx.Graph)
Destriptive statistics of different networks#
The final part of the preprocessing part consists of calculating different network properties for each network. The visualizations section explains what the different properties mean in more detail.
networks = {
'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
'DeezerHR' : DeezerHR,
'DeezerHU' : DeezerHU,
'DeezerRO' : DeezerRO,
'FacebookLargepage' : FacebookLargePage,
'FeatherDeezerSocial' : FeatherDeezerSocial,
'FeatherLastfmSocial' : FeatherLastfmSocial,
'GithubSocial' : GithubSocial,
'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
'TwitchSocialNetworksES' : TwitchSocialNetworksES,
'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
'SocSignSlashdot090221' : SocSignSlashdot090221,
'SocSlashdot0902' : SocSlashdot0902,
}
The code below uses networkx and will take some time to run. That is why the results are written to a csv file that can be read in later.
# start = time.time()
# descriptive_stats = pd.DataFrame(index=list(networks.keys()), columns=['number_of_nodes', 'number_of_edges', 'average_degree', 'density', 'connected_network', 'avg_cc', 'transitivity', 'average_shortest_path_length', 'diameter', 'has_bridges', 'number_of_bridges', 'max_degree'])
# for name, network in networks.items():
# descriptive_stats['number_of_nodes'].loc[name] = nx.number_of_nodes(network)
# descriptive_stats['number_of_edges'].loc[name] = nx.number_of_edges(network)
# descriptive_stats['average_degree'].loc[name] = np.mean(list(dict(network.degree()).values()))
# descriptive_stats['density'].loc[name] = nx.density(network)
# descriptive_stats['connected_network'].loc[name] = nx.is_connected(network)
# descriptive_stats['avg_cc'].loc[name] = nx.average_clustering(network)
# descriptive_stats['transitivity'].loc[name] = nx.transitivity(network)
# descriptive_stats['has_bridges'].loc[name] = nx.has_bridges(network)
# descriptive_stats['number_of_bridges'].loc[name] = len(list(nx.bridges(network)))
# descriptive_stats['max_degree'].loc[name] = max(dict(nx.degree(network)).values())
# descriptive_stats.to_csv('results/descriptive_stats_of_networks_2.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')
Networkx can be used to calculate many different network properties as seen above. However, some network properties require more efficient algorithms to compute.
Below this cell all the code can be found that was used to calculated additional network properties of the network. This code was run in a google Colab notebook because this allows us to use Google’s GPU for gpu accelerated computing. This was needed because the calculation of some network properties require a lot of processing power. Instead of pandas and networkx the python libaries cudf and cugraph were used. cudf and cugraph are similar alternatives that are based on CUDA.
# import cudf
# import cugraph as cnx
# import time
# import numpy as np
# import json
# import itertools
# # Reading in the data with cudf
# NashvilleMeetupNetwork_member_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
# DeezerHR_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HR/HR_edges.csv')
# DeezerHU_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HU/HU_edges.csv')
# DeezerRO_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/RO/RO_edges.csv')
# FacebookLargePage_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
# FeatherDeezerSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-deezer-social/deezer_europe_edges.csv')
# FeatherLastfmSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-lastfm-social/lastfm_asia_edges.csv')
# GithubSocial_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/github-social/musae_git_edges.csv")
# TwitchSocialNetworksDE_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/DE/musae_DE_edges.csv")
# TwitchSocialNetworksENGB_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
# TwitchSocialNetworksES_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ES/musae_ES_edges.csv")
# TwitchSocialNetworksFR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/FR/musae_FR_edges.csv")
# TwitchSocialNetworksPTBR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
# TwitchSocialNetworksRU_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/RU/musae_RU_edges.csv")
# SocSignSlashdot090221_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-sign-Slashdot090221.csv')
# SocSlashdot0902_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-Slashdot0902.csv')
# # Constructing networks with cugraph
# NashvilleMeetupNetwork = cnx.from_cudf_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=cnx.Graph)
# DeezerHR = cnx.from_cudf_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerHU = cnx.from_cudf_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerRO = cnx.from_cudf_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FacebookLargePage = cnx.from_cudf_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# FeatherDeezerSocial = cnx.from_cudf_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FeatherLastfmSocial = cnx.from_cudf_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# GithubSocial = cnx.from_cudf_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# TwitchSocialNetworksDE = cnx.from_cudf_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksENGB = cnx.from_cudf_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksES = cnx.from_cudf_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksFR = cnx.from_cudf_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksPTBR = cnx.from_cudf_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksRU = cnx.from_cudf_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=cnx.Graph)
# SocSignSlashdot090221 = cnx.from_cudf_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=cnx.Graph)
# SocSlashdot0902 = cnx.from_cudf_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=cnx.Graph)
# networks = {
# 'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
# 'DeezerHR' : DeezerHR,
# 'DeezerHU' : DeezerHU,
# 'DeezerRO' : DeezerRO,
# 'FacebookLargepage' : FacebookLargePage,
# 'FeatherDeezerSocial' : FeatherDeezerSocial,
# 'FeatherLastfmSocial' : FeatherLastfmSocial,
# 'GithubSocial' : GithubSocial,
# 'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
# 'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
# 'TwitchSocialNetworksES' : TwitchSocialNetworksES,
# 'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
# 'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
# 'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
# 'SocSignSlashdot090221' : SocSignSlashdot090221,
# 'SocSlashdot0902' : SocSlashdot0902,
# }
# # Code used to compute the average shortest path length and diameter
# start = time.time()
# descriptive_stats = cudf.DataFrame(index=list(networks.keys()), columns=['average_shortest_path_length', 'diameter'])
# for name, network in networks.items():
# shortest_path_lengths = [cnx.bfs_edges(network, source=x)['distance'].max() for x in network.nodes().to_numpy()]
# descriptive_stats['average_shortest_path_length'].loc[name] = sum(shortest_path_lengths)/len(shortest_path_lengths)
# descriptive_stats['diameter'].loc[name] = max(shortest_path_lengths)
# descriptive_stats.to_csv('/content/drive/MyDrive/Colab Notebooks/results/descriptive_stats_of_networks.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')
Combining the calculatings from networkx and cugraph gives the following dataframe:
descriptive_stats = pd.read_csv('results/descriptive_stats_of_networks.csv', index_col=0)
descriptive_stats
| number_of_nodes | number_of_edges | average_degree | density | connected_network | avg_cc | transitivity | average_shortest_path_length | diameter | has_bridges | number_of_bridges | max_degree | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NashvilleMeetupNetwork | 11372 | 1176368 | 206.888498 | 0.018194 | True | 0.884957 | 0.604407 | 4.413648 | 6 | True | 13 | 2356 |
| DeezerHR | 54573 | 498202 | 18.258186 | 0.000335 | True | 0.136477 | 0.114630 | 8.501347 | 12 | True | 2435 | 420 |
| DeezerHU | 47538 | 222887 | 9.377214 | 0.000197 | True | 0.116187 | 0.092924 | 9.930603 | 14 | True | 2869 | 112 |
| DeezerRO | 41773 | 125826 | 6.024274 | 0.000144 | True | 0.091212 | 0.075267 | 12.970340 | 19 | True | 6403 | 112 |
| FacebookLargepage | 22470 | 171002 | 15.220472 | 0.000677 | True | 0.359738 | 0.232321 | 10.443525 | 15 | True | 2973 | 709 |
| FeatherDeezerSocial | 28281 | 92752 | 6.559315 | 0.000232 | True | 0.141160 | 0.095922 | 15.138963 | 21 | True | 6470 | 172 |
| FeatherLastfmSocial | 7624 | 27806 | 7.294334 | 0.000957 | True | 0.219418 | 0.178623 | 10.457240 | 15 | True | 1929 | 216 |
| GithubSocial | 37700 | 289003 | 15.331724 | 0.000407 | True | 0.167537 | 0.012357 | 7.228674 | 11 | True | 5241 | 9458 |
| TwitchSocialNetworksDE | 9498 | 153138 | 32.246368 | 0.003395 | True | 0.200886 | 0.046471 | 5.085176 | 7 | True | 453 | 4259 |
| TwitchSocialNetworksENGB | 7126 | 35324 | 9.914117 | 0.001391 | True | 0.130928 | 0.042433 | 7.001684 | 10 | True | 1222 | 720 |
| TwitchSocialNetworksES | 4648 | 59382 | 25.551635 | 0.005499 | True | 0.222496 | 0.084235 | 6.649312 | 9 | True | 301 | 1022 |
| TwitchSocialNetworksFR | 6549 | 112666 | 34.407085 | 0.005255 | True | 0.221706 | 0.054128 | 5.089327 | 7 | True | 257 | 2040 |
| TwitchSocialNetworksPTBR | 1912 | 31299 | 32.739540 | 0.017132 | True | 0.319895 | 0.130981 | 4.905335 | 7 | True | 116 | 767 |
| TwitchSocialNetworksRU | 4385 | 37304 | 17.014367 | 0.003881 | True | 0.165797 | 0.048648 | 6.288940 | 9 | True | 461 | 1229 |
| SocSignSlashdot090221 | 82140 | 500481 | 12.186048 | 0.000148 | True | 0.058787 | 0.023618 | 8.911346 | 13 | True | 30086 | 2548 |
| SocSlashdot0902 | 82168 | 582533 | 14.179072 | 0.000173 | True | 0.060345 | 0.024109 | 8.910720 | 13 | True | 30035 | 2554 |
3. Who knows whom? - Degree distribution#
Degree distribution of networks.
Firstly, we will discuss everything that has to do with the term ‘degree’ in the network theory environment. The degree is the number of connections (or neighbours) a node has. Degrees play an important role in understanding interactions between nodes. This is fundamentally used to measure the connectivity or centrality of a node in a network. For instance, a node with a high degree probably is vital in keeping the network stable and connected. In other words, nodes with a high degree have a lot of influence. A node with a significantly high degree is often referred to as a hub. A hub could act as a central point to exchange information. When looking at real life situations, we see that people with more connections often have power and influence in a social network. As this implies, a famous person usually has a very high degree. On the other hand, chances are that people who live quite remotely, will have a low degree.
It is known that a random network has a binomial degree distribution (Barabasi, 2016). By analysing the degree distribution of the different networks, we can take the first steps into categorizing the networks into real or random networks. If the distribution of a network is binomial, then the network is better described by the random network model. If not, it more likely to be a real network.
degree_dist = go.Figure()
for network_name, network in networks.items():
degree_sequence = sorted((d for n, d in network.degree()), reverse=True)
degree_dist.add_trace(go.Scatter(x=degree_sequence,
y=np.arange(len(degree_sequence)),
mode='lines',
name=network_name))
degree_dist.update_layout(title="Degree distribution of all networks",
xaxis_title="Degree",
yaxis_title="Nodes",
xaxis_type="log"
)
degree_dist.show()
As the above chart shows, the degree distribution is far from binomial. The x-axis is even made logarithmic to account for the exponential distribution of the degree. Thus, considering the degree distribution. All observed networks are most likely real networks.
4. Connectedness#
Klein stukje tekst over connectedness in real networks. Hoeft niet met visualisatie. Moet nog geschreven worden voor de final tekst.
5. Small world property#
Only few steps needed to reach any arbitrairy node in the network.
Some definitions
If it is possible to follow links to go from one node to another node in the network, it could be said that there is a path between the two nodes. The path length is the number of links in a path. Between the same nodes, there can be multiple paths. Therefore, the path with the least number of links is called the shortest path. Subsequently, the distance between two nodes is defined as the length of the shortest path between nodes, also called the shortest path length.
In this project the focus was layed upon networks as a whole and less on individual nodes in a network. For this reason, it should be investigated how close or far we expect nodes to be in a network. To achieve this, another network property is introduced, called the average shortest path length. The average shortest path length of a network is calculated by averaging the shortest path lengths across all pairs of nodes.
\(a =\sum_{\substack{s,t \in V \\ s\neq t}} \frac{d(s, t)}{n(n-1)}\)
where V is the set of nodes in G,
d(s, t) is the shortest path from s to t,
and n is the number of nodes in G.
Another network property related to shortest paths, is the diameter, defined as the maximum shortest path length across all pairs of nodes.
Small world property
The small world property means that there are only a few steps needed to reach any arbitrairy node in the network. In other words, two individuals anywhere in the world can be connected through a few acquaintances. This relationship can be expressed mathematically. According to Menczer et al (2020) the average path length scales logarithmically with the size of the network.
\(a \approx log(n)\)
where n is the number of nodes in G.
This raises the question of whether or not this small world property is true on the networks that will be analysed.
swp = px.scatter(descriptive_stats,
x="number_of_nodes",
y="average_shortest_path_length",
size = "diameter",
trendline= "ols",
trendline_options=dict(log_x=True),
color = "diameter",
title="Relation between average path length and number of nodes",
hover_name = descriptive_stats.index)
swp.update_layout(
xaxis = dict(
title = 'Amount of nodes'),
yaxis = dict(
title = 'Average length between each node')
)
swp.show()
According to Barabasi, the small-world property is the only property that can be explained by the random network model. Thus, with this plot, there is no way to clearly distinguish whether the networks are real or random. However, the takeaways are firstly that for some networks the average path length indeed scales logarithmically with the size of the network. Secondly, there are only a handful of networks with an short average path lengths compared to the number of nodes in that network. Unsurprisingly these are also the networks with the shortest diameter.
6. Friendships in communities - Triadic closure#
Meet new friends through shared contacts.
In a social network, if A and B are both friends with C, then it is very possible that A and B are friends with each other. This means that there is a good chance that a friend of my friend is also my friend. The connectivity among neighbors of the nodes is important because it shows how tightly clustered the nodes are. The clustering coefficient takes the fraction of the neighbors of a node that are connected. In this project, we aim to focus more on networks in their entirety rather than nodes themselves. Therefore, we chose to examine the average clustering coefficient of our networks. Networks with a low average clustering coefficient have very few triangles and networks with a high average clustering coefficient have many triangles. A triangle is a set of three nodes (triad) where each pair of nodes is connected. When we meet people through shared contacts, we form closing triangles. The plethora of closed triangles is defined by a phenomenon called triadic closure. This phenomenon plays a crucial role in the formation and maintenance of communities within networks.
In conclusion, the average clustering coefficient builds and maintains communities while the average degree shows how dense the communities are in a social network. This raises an interesting question about the relationship between the average clustering coefficient and the average degree. By plotting the correlation between these two variables over all datasets, we see that there is a strong positive correlation of 0.930455. We followed up by plotting a figure that gives insight of what makes the correlation so strong between the various networks with their characteristics. The corresponding figure is a bubble chart using three variables, avg_cc, average_degree, and number_of_nodes.
descriptive_stats['average_degree'].corr(descriptive_stats['avg_cc'])
0.9304554018401265
bubble = px.scatter(descriptive_stats,
x = 'average_degree',
y = 'avg_cc',
size = 'number_of_nodes',
labels = list(descriptive_stats.index),
title = 'Plotsketch: Average clustering coefficient per average degree',
hover_name = descriptive_stats.index,
color = descriptive_stats.index,
log_x = True, size_max = 60)
bubble.update_layout(
title = 'Average clustering coefficient per average degree',
xaxis = dict(
title = 'Average degree',
gridcolor = 'white',
type = 'log',
gridwidth = 2,
),
yaxis = dict(
title = 'Average clustering coefficient',
gridcolor = 'white',
gridwidth = 2,
),
paper_bgcolor = 'rgb(243, 243, 243)',
plot_bgcolor = 'rgb(243, 243, 243)',
)
bubble.show()
colors = px.colors.qualitative.Plotly[:len(networks)]
avg_cc = px.bar(descriptive_stats, x=list(networks.keys()), y='avg_cc', color=list(networks.keys()), color_discrete_sequence=colors, title='Average clustering coefficient for all networks')
avg_cc.update_layout(xaxis_title='Networks', yaxis_title='Average clustering coefficient')
avg_cc.show()
Sketch visualisation#
According to Barabasi (2016), in a real network high degree nodes tend to have a smaller clustering coefficient than low degree nodes. We would like to plot this for the networks. Currently we can plot this only for NashvilleMeetupNetwork. NashvilleMeetupNetwork is chosen first because this is the network with the highest average clusering coefficient.
In order to minimize the loading time of the jupyter notebook, the computations are done in advance and written to a csv file, which then can be loaded in.
# degree_dict = dict(NashvilleMeetupNetwork.degree())
# cc_dict = dict(nx.clustering(NashvilleMeetupNetwork))
# data = {'degree': degree_dict, 'cc' : cc_dict}
# NashVilleMeetupNetwork_degree_cc_ = pd.DataFrame(data)
# NashVilleMeetupNetwork_degree_cc_.to_csv('results/NashvilleMeetupNetwork_degree_vs_cc.csv')
NashVilleMeetupNetwork_degree_cc_ = pd.read_csv('results/NashvilleMeetupNetwork_degree_vs_cc.csv', index_col=0)
fig = px.scatter(NashVilleMeetupNetwork_degree_cc_, x="degree", y="cc", title="Plotsketch: NashvilleMeetup clusering coefficient vs degree")
fig.show()
NashVilleMeetupNetwork_degree_cc_.corr()
| degree | cc | |
|---|---|---|
| degree | 1.000000 | -0.645953 |
| cc | -0.645953 | 1.000000 |
Indeed, this looks like a real network. A pearson correlation of -0.65 futher substantiates this hypothisis.
7. Conclusion#
Given the fact that real networks with a small world property, tend to have a high clustering coefficient, it can be concluded that NashvilleMeetupNetwork is the only network that satisfies all real network properties. Thus, NashvilleMeetupNetwork is the only network that is a real network.
This begs the explanation what the other networks are and why the other networks lack high clustering coefficients. The NashvilleMeetupNetwork is the only network where the edges represent real world interactions and not interactions in a digital environment. Thus, we assume that the way humans interact digitally is completely different than the way humans interact in real life. In the end we are social creatures.
8. References#
Barabasi, A. L. (2016). Network Science. Cambridge: Cambridge University Press. http://networksciencebook.com/
Menczer, F., Fortunato, S., & Davis, C. (2020). A First Course in Network Science. Cambridge: Cambridge University Press. doi:10.1017/9781108653947
Appendix 1#
Code used to compute similarities in the different networks.
(All node attributes files for the different datasets are structured in a different way. Due to the short time available for this project and the aim on the visualisatioins, we did not prioritize writing code that would work on all different network datasets. That is why only a subset of all networks can be used for this analysis.)
# def write_json_to_edge_list(json_path, edge_path):
# with open(json_path, 'r') as file:
# json_object = json.loads(file.read())
# # Create empty lists for sources and targets
# sources = []
# targets = []
# # Process the JSON data
# for source, genres in json_object.items():
# for genre in genres:
# sources.append(source)
# targets.append(genre)
# # Create the cuDF DataFrame
# df = cudf.DataFrame({'source': sources, 'target': targets})
# df.to_csv(edge_path)
# return
# def write_jaccard_similarity_to_csv(edge_path, jaccard_sim_path):
# G_edges = cudf.read_csv(edge_path, index_col=0, dtype='str')
# G_nodes = G_edges['source'].unique()
# # print(G_nodes)
# G = cnx.from_cudf_edgelist(G_edges, 'source', 'target', create_using=cnx.Graph)
# Jaccard = cnx.jaccard(G)
# print(G.number_of_nodes())
# # display(Jaccard)
# # J2 = Jaccard[Jaccard['first'].isin(G_nodes) & Jaccard['second'].isin(G_nodes)]
# # display(J2)
# # J3 = J2[J2['first'] != J2['second']]
# # J4 = J3.sort_values(by='jaccard_coeff', ascending=False)
# # display(J4)
# Jaccard = Jaccard.rename(columns={'jaccard_coeff': 'jaccard_genre_sim'})
# Jaccard.to_csv(jaccard_sim_path)
# return
# def similarity_comparison(edge_path, jaccard_sim_path, network, final_sim_path):
# # write_jaccard_similarity_to_csv(edge_path, jaccard_sim_path)
# jaccard_genre_sim = cudf.read_csv(jaccard_sim_path, index_col=0)
# # display(jaccard_genre_sim)
# Jaccard = cnx.jaccard(network)
# Overlap = cnx.overlap(network)
# J_O_join = Jaccard.merge(Overlap, how='inner')
# final = J_O_join.merge(jaccard_genre_sim2, ['first', 'second'], how='left')
# final = final.dropna()
# final.to_csv(final_sim_path)
# return
Appendix 2#
Nashville https://www.kaggle.com/datasets/stkbailey/nashville-meetup?select=rsvps.csv
Deezer https://www.kaggle.com/datasets/andreagarritano/deezer-social-networks
Social Graphs https://www.kaggle.com/datasets/wolfram77/graphs-social
# Handig voor inzicht in correlaties gedurende het project
descriptive_stats.corr()
| number_of_nodes | number_of_edges | average_degree | density | connected_network | avg_cc | transitivity | average_shortest_path_length | diameter | has_bridges | number_of_bridges | max_degree | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| number_of_nodes | 1.000000 | 0.405829 | -0.243899 | -0.497133 | NaN | -0.438849 | -0.275756 | 0.404563 | 0.434850 | NaN | 0.857008 | 0.127392 |
| number_of_edges | 0.405829 | 1.000000 | 0.771754 | 0.352650 | NaN | 0.590780 | 0.655455 | -0.220672 | -0.216164 | NaN | 0.327638 | 0.214918 |
| average_degree | -0.243899 | 0.771754 | 1.000000 | 0.760696 | NaN | 0.930455 | 0.886746 | -0.478397 | -0.495127 | NaN | -0.212021 | 0.104513 |
| density | -0.497133 | 0.352650 | 0.760696 | 1.000000 | NaN | 0.789591 | 0.651239 | -0.631171 | -0.648358 | NaN | -0.355927 | -0.028833 |
| connected_network | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| avg_cc | -0.438849 | 0.590780 | 0.930455 | 0.789591 | NaN | 1.000000 | 0.946820 | -0.427280 | -0.446112 | NaN | -0.394952 | 0.026952 |
| transitivity | -0.275756 | 0.655455 | 0.886746 | 0.651239 | NaN | 0.946820 | 1.000000 | -0.188503 | -0.208922 | NaN | -0.300336 | -0.159003 |
| average_shortest_path_length | 0.404563 | -0.220672 | -0.478397 | -0.631171 | NaN | -0.427280 | -0.188503 | 1.000000 | 0.996961 | NaN | 0.270928 | -0.359790 |
| diameter | 0.434850 | -0.216164 | -0.495127 | -0.648358 | NaN | -0.446112 | -0.208922 | 0.996961 | 1.000000 | NaN | 0.300537 | -0.320203 |
| has_bridges | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| number_of_bridges | 0.857008 | 0.327638 | -0.212021 | -0.355927 | NaN | -0.394952 | -0.300336 | 0.270928 | 0.300537 | NaN | 1.000000 | 0.143028 |
| max_degree | 0.127392 | 0.214918 | 0.104513 | -0.028833 | NaN | 0.026952 | -0.159003 | -0.359790 | -0.320203 | NaN | 0.143028 | 1.000000 |
# is alleen accuraat bij restart run all
end = time.time()
time_in_seconds = end-start
print(f"Tijd om notebook te runnen: {time_in_seconds}")
Tijd om notebook te runnen: 8.466467142105103